Building happiness

Learnings and choices made along the way

Building the happiness demo was the first time I tried to integrate the new bokeh server (released in v0.11) with django. Having developed a lot of django web applications, I wanted to prove out a number of uses that I thought were important to demonstrate with bokeh server.

Along the way I learned some things about how the bokeh server works, learned about different choices I could make in the implementation, and also learned some things that we could improve in the server!

Given that everyone's use case is different, I thought it would be useful to document this learning.

Use cases

With the happiness demo, I wanted to build a system that:

  • had private data - users must be logged in to see their data (and their plots)
  • was a django application with different types of users - different users have access to different types of data
  • plots updated in real-time - when one user updated their data, that update is automatically pushed down to other users
  • showed a bokeh server and a django server deployed seperately, so they could be deployed on different servers

First implementation

The first implementation of happiness is tagged with happiness_v1 - https://github.com/bokeh/bokeh-demos/tree/happiness_v1/happiness

It's worth noting that in this first implementation I didn't actually have users log in as this makes exploring the app take more time. By clicking on a user in the left menu we were treating that like a login - but the exact same setup would have worked had logins been used.

How it worked

  • bokeh server was serving up four empty plots (individual, individuals, team, teams)
    • these have all the code to make the plots, but an empty data source for the plot
    • they all had an additional datasource user_pk_source that was used to pass the user_id from django to bokeh
  • when a user logs into django and was presented with their plots the following happened:
    • a call was made to pull_session which caused bokeh to generate a new session
    • django then populated this session with the user id of the logged in user for whom this session was for
    • bokeh then used that user id to get the user from the database along with their associated data
    • bokeh then polled the database every 5 seconds for new data for that user in a periodic_callback

Things learned with First implementation

Use pull_session with no session_id to have bokeh generate one for you

The following code asks bokeh server to generate a session for the plot script that's in individual.py and generate a unique session.

bokeh_session = pull_session(session_id=None, url='http://localhost:5006/individual/')
Where bokeh plots are served from is limited

To make my plots available I launched bokeh server with bokeh serve viz/individual.py viz/individuals.py this resulted in plots being available at http://localhost:5006/individual/ and http://localhost:5006/individuals/.

There is no way to change those urls, they are based on the file name, although you can add a prefix with the --prefix option passed to bokeh serve.

In addition, if you use the bokeh app folder structure http://bokeh.pydata.org/en/0.11.0/docs/user_guide/server.html#directory-format then you can only have one plot in each directory.

You need to program defensively

While something like django's template engine is very forgiving, bokeh server isn't. User's aren't going to get any helpful message if the plot fails to load, and django isn't going to know that the plot is going to fail to load. As a result it's important to make sure all inputs are available for bokeh and appropriate fallbacks are available.

autoload_server has a lot of options

There are a number of examples of using autoload_server in the bokeh examples, but there are actually a lot of different ways to use it and it's worth checking out the docs http://bokeh.pydata.org/en/0.11.0/docs/reference/embed.html#bokeh.embed.autoload_server

There's no way (yet) to pass a bokeh session a variable

In this implementation I had to setup a ColumnDataSource called user_pk_source to pass the user_id (or object.pk) to bokeh. The ColumnDataSource is always a dictionary of lists. This meant passing the user_id in the form

{
    'user_pk': [user_id]
}

The code that did this is:

user_source = bokeh_session.document.get_model_by_name('user_pk_source')
user_source.data = dict(user_pk=[self.object.pk])

This was pretty clunky, what would have been nice would be some kind of custom session attribute so we could do

# Warning - does not currently work
bokeh_session.user_pk = self.object.pk

There is an open issue https://github.com/bokeh/bokeh/issues/3349 to add this feature.

It's worth noting that the method document.get_model_by_name was useful in making this relatively concise and transparent. For this to work though you must add the source to the document so that django can access it:

document.add_root(plot)
document.add_root(user_pk_source)
The order that you add things to the Document is (was) important

As noted above, I needed to add the user_pk_source to the Document so that I could access it with the document.get_model_by_name method. However, it turned out to be very important which order that you add things to the Document. The plot had to be added first in order to ensure that the plot rendered correctly. There is an open pull request fixing this constraint: https://github.com/bokeh/bokeh/pull/3624

Bokeh accessing Django ORM

To allow bokeh to use the Django ORM, I needed to start the bokeh server with the django settings available. The bokeh server was started with the command

PYTHONPATH=$PWD DJANGO_SETTINGS_MODULE=webapp.settings bokeh serve viz/individual.py viz/individuals.py viz/team.py viz/teams.py --log-level=info --host=localhost:5006 --host=localhost:8001

In addition, I needed bokeh to make a call to django.setup() once. In each plot declaration I had the following code:

from viz.utils import django_setup
if not django_setup:
    import django
    django.setup()
    django_setup = True
Use --host argument to give django access to bokeh

The following command was use to start bokeh server. In it we give the django server (which is running on port 8001) the ability to access bokeh server as well as giving the bokeh server permission to access itself.

PYTHONPATH=$PWD DJANGO_SETTINGS_MODULE=webapp.settings bokeh serve viz/individual.py viz/individuals.py viz/team.py viz/teams.py --log-level=info --host=localhost:5006 --host=localhost:8001

Add a timeout callback to initially render the plot

The periodic callback in bokeh was set to 5 seconds so that the database wasn't being hit too often. To setup the periodic callback I did

document.add_periodic_callback(update_data, 5000)

However, this delayed loading of the initial plot. To compensate for this I added the following

def update_data_once():
    update_data()

document.add_timeout_callback(update_data_once, 250)

add_timeout_callback adds a callback to be invoked once, after a given amount of time - 250ms in this case.

What I wanted to do was just:

# Warning - does not currently work
document.add_timeout_callback(update_data, 250)
document.add_periodic_callback(update_data, 5000)

Unfortunately this doesn't work at the moment and bokeh kicks up an error because we've tried to add the same callback method twice. As a result I needed to add the dummy method update_data_once.

Legends only render once in bokeh

In this demo, plots have legends, that change based on the user's data. It turns out that the legend layout calculation is done only once. This means that if you're not careful, you can end up with your legend being laid out poorly with text overlapping the legend glyph. The main change I made which seemed to prevent this giving a modest delay to add_timeout_callback. I didn't dig into this a lot, just got something working.

Things I would have improved

Had I not scrapped this implementation I would have made some improvements

I should have been closing sessions in my django code

The code that populated the django session was:

bokeh_session = pull_session(session_id=None, url='http://localhost:5006/%s/' % suffix)
user_source = bokeh_session.document.get_model_by_name('user_pk_source')
user_source.data = dict(user_pk=[self.object.pk])
script = autoload_server(None, app_path='/%s' % suffix, session_id=bokeh_session.id)

This was leaving bokeh sessions open, I should have cleaned up after myself by calling bokeh_session.close() as soon as I had finished changing the data:

bokeh_session = pull_session(session_id=None, url='http://localhost:5006/%s/' % suffix)
user_source = bokeh_session.document.get_model_by_name('user_pk_source')
user_source.data = dict(user_pk=[self.object.pk])
bokeh_session.close()
script = autoload_server(None, app_path='/%s' % suffix, session_id=bokeh_session.id)
Only send new data

Bokeh does not currently have a way to send partial updates of a data source. That means that if the data source is changed, the whole data goes down the web socket again. My data was being updated on every periodic callback (every 5 seconds). This wastes bandwidth and I'm not sure what it would have done to performance on a slow connection.

Partial patching of data sources is in the works for bokeh, but in the meantime, I could have been checking to see whether the data had changed and only if it had update the data source.

There may have been other optimizations - such as only getting the data for the timeframe the user was looking at.

Security - only allow django to make sessions

Bokeh server comes with a number of features that allow you to ensure that sessions are only made when you want them to be. These are tucked away in the docs here: http://bokeh.pydata.org/en/0.11.0/docs/user_guide/cli.html#session-id-options

If a not-logged in user tried to hit my bokeh server directly they would only be served with a blank plot which is probably fine, but using these options would have improved security further. Given that bokeh already had access to django's settings, they probably could have shared the django secret key.

Here's some more context from a key author of the new bokeh server

The purpose of the signed session ID is to keep someone from connecting to the bokeh server directly (without django app) and getting a session. As you say, if the sessions are empty anyway until your Django app fills them in, it's harmless (other than resource usage) for people to connect, so you wouldn't have to use external-signed mode. But say for example your sessions did have some type of proprietary information (perhaps not per-user info, just info that should only be accessible to someone who's logged in or someone who's a member of a certain group), and that info was in the session by default - then you might want external-signed so that Django could control access to the bokeh server.

I would think using external-signed is a good default practice in this sort of app, since it's more locked down and it presumably isn't useful to visit the bokeh server directly.

source: https://github.com/bokeh/bokeh-demos/pull/13#issuecomment-169482037

Improved fallbacks

If I was putting this into production, I would probably have had the bokeh plot fallback to a text glyph saying "i'm sorry there was a problem loading your plot" or something similar to give a better user experience if the plot failed to load.

Problems with this implementation and what I subsequently learned

The problem - hitting the database unecessarily

The big problem that I saw with this implementation was that it was that every session was hitting the database every 5 seconds. Although this was fine at small scale it seemed like a waste of resources and unlikely to scale well. In addition, knowing that django has on_save hooks - meaning I could take an action when data was updated - I wanted to do better.

After talking with some people I learned some more fundamentals about how bokeh server works and that there are better ways of setting this particular example up.

What I learned

I had been thinking of a Session as an instance of a Document. It's not. A Session is just a container that can hold a Document. It can hold any Document and you don't have to declare that Document in the bokeh app code! This may be obvious to some, but was a bit of a revelation to me. I had only seen examples that used the Document or curdoc and then served up that file with bokeh serve and I hadn't connected all the dots.

What connected the dots was Havoc highlighting difference between pull_session and push_session. Both create a session. But pull_session gives you the state from the bokeh server and in push_session you give the state to bokeh server. The state is an instance of Document. The key is then deciding which side is authoritative. If bokeh server is authoritative then you would want to use pull_session to get the session from bokeh and then modify it how you need to. On the other hand, if django is authoritative then using push_session you can just build the Document in django with the data from django and push it to the server.

The main problem I wanted to fix was to only have data updated when there was new data. From a django perspective, this means bokeh only updating when a save event occurs which can be done with the post_save signal. So the idea is:

  • instead of bokeh checking the database every 5 seconds
  • have django update the all the bokeh sessions' data in a post_save signal

There are a number of ways of doing this, in my second implementation I am going for the most simple, but there are other ways - and these are discussed at the end of this notebook.

Note: "First implementation" could still be a good set-up

Although I then re-wrote the example, it's worth noting that I can imagine scenarios where I would still use a setup like this. With the bokeh server still new, we don't yet have enough real world experience of what works well and not for scalable web-facing (not on a private network) applications. Please do share your experiences on the bokeh mailing list: https://groups.google.com/a/continuum.io/forum/#!forum/bokeh

Second Implementation

How it works

In my second implementation, bokeh-server will be very naive and django will do all the work.

Django will push all the data to bokeh sessions and when there is new data django will push it to bokeh server. Bokeh server will then just focus on its job of updating the attached clients and pushing that new data down to it.

This cleans up a number of things, but does mean that in django I will need to keep track of bokeh sessions.

Given that django already has sessions, I can just add the bokeh session ids to the user's session and they'll be readily available for me to update.

Two options for updating a bokeh session.

It's worth noting that there are two options for updating a bokeh session and which one you pick depends on your use case. You can either store the session_id or keep the Session object. If you keep the Session object, you'll be keeping the sessions open between django and bokeh. This has the advantage that you don't need to download the document again when you want to update the data, but the disadvantage that you'll be keeping all the sessions open. If you just store the session_id then you'll need to use pull_session to download the Document from bokeh server and then update the data.

Given in the happiness I don't expect data updates to be too frequent, I'm going with the second option of storing the session_id and not keeping the session open. I'm not sure this is the correct choice - please do share your experiences.

What this cleans up

In doing it this way, my bokeh server is a completely empty implementation. There is no need for the four instances of individual, individuals, team, teams. All the bokeh code will live under my django views and django will push the plots to bokeh. This should make everything cleaner and easier to follow what's going on as the code isn't in two places (under bokeh or django). It also means that I don't need to restart the bokeh server any time I want to make a change to a bokeh plot.


In [ ]: